Search CORE

13 research outputs found

Synthetic Data Augmentation for Zero-Shot Cross-Lingual Question Answering

Author: Keraron Rachel
Riabi Arij
Sagot Benoît
Scialom Thomas
Seddah Djamé
Staiano Jacopo
Publication venue
Publication date: 14/10/2021
Field of study

Coupled with the availability of large scale datasets, deep learning architectures have enabled rapid progress on the Question Answering task. However, most of those datasets are in English, and the performances of state-of-the-art multilingual models are significantly lower when evaluated on non-English data. Due to high data collection costs, it is not realistic to obtain annotated data for each language one desires to support. We propose a method to improve the Cross-lingual Question Answering performance without requiring additional annotated data, leveraging Question Generation models to produce synthetic samples in a cross-lingual fashion. We show that the proposed method allows to significantly outperform the baselines trained on English data only. We report a new state-of-the-art on four multilingual datasets: MLQA, XQuAD, SQuAD-it and PIAF (fr).Comment: 7 page

arXiv.org e-Print Archive

INRIA a CCSD electronic archive server

Universal NER: A Gold-Standard Multilingual Named Entity Recognition Benchmark

Author: Blevins Terra
Gonen Hila
Imperial Joseph Marvin
Karlsson Börje F.
Lin Peiqin
Liu Shuheng
Ljubešić Nikola
Mayhew Stephen
Miranda LJ
Pinter Yuval
Plank Barbara
Riabi Arij
Šuppa Marek
Publication venue
Publication date: 15/11/2023
Field of study

We introduce Universal NER (UNER), an open, community-driven project to develop gold-standard NER benchmarks in many languages. The overarching goal of UNER is to provide high-quality, cross-lingually consistent annotations to facilitate and standardize multilingual NER research. UNER v1 contains 18 datasets annotated with named entities in a cross-lingual consistent schema across 12 diverse languages. In this paper, we detail the dataset creation and composition of UNER; we also provide initial modeling baselines on both in-language and cross-lingual learning settings. We release the data, code, and fitted models to the public

arXiv.org e-Print Archive

Universal NER:A Gold-Standard Multilingual Named Entity Recognition Benchmark

Author: Blevins Terra
Gonen Hila
Imperial Joseph Marvin
Karlsson Börje F.
Lin Peiqin
Liu Shuheng
Ljubešić Nikola
Mayhew Stephen
Miranda LJ
Pinter Yuval
Plank Barbara
Riabi Arij
Šuppa Marek
Publication venue: 'Center for Open Science'
Publication date: 15/11/2023
Field of study

OPUS

Can Character-based Language Models Improve Downstream Task Performance in Low-Resource and Noisy Language Scenarios?

Author: Riabi Arij
Sagot Benoît
Seddah Djamé
Publication venue: HAL CCSD
Publication date: 10/01/2022
Field of study

International audienceRecent impressive improvements in NLP, largely based on the success of contextual neural language models, have been mostly demonstrated on at most a couple dozen high-resource languages. Building language models and, more generally, NLP systems for non-standardized and low-resource languages remains a challenging task. In this work, we focus on North-African colloquial dialectal Arabic written using an extension of the Latin script, called NArabizi, found mostly on social media and messaging communication. In this low-resource scenario with data displaying a high level of variability, we compare the downstream performance of a character-based language model on part-of-speech tagging and dependency parsing to that of monolingual and multilingual models. We show that a character-based model trained on only 99k sentences of NArabizi and fined-tuned on a small treebank of this language leads to performance close to those obtained with the same architecture pre-trained on large multilingual and monolingual models. Confirming these results a on much larger data set of noisy French user-generated content, we argue that such character-based language models can be an asset for NLP in low-resource and high language variability set-tings

INRIA a CCSD electronic archive server

Enriching the NArabizi Treebank: A Multifaceted Approach to Supporting an Under-Resourced Language

Author: Mahamdi Menel
Riabi Arij
Seddah Djamé
Publication venue: Association for Computational Linguistics
Publication date: 10/06/2023
Field of study

International audienceIn this paper we address the scarcity of annotated data for NArabizi, a Romanized form of North African Arabic used mostly on social media, which poses challenges for Natural Language Processing (NLP). We introduce an enriched version of NArabizi Treebank (Seddah et al., 2020) with three main contributions: the addition of two novel annotation layers (named entity recognition and offensive language detection) and a re-annotation of the tokenization, morpho-syntactic and syntactic layers that ensure annotation consistency. Our experimental results, using different tokenization schemes, showcase the value of our contributions and highlight the impact of working with non-gold tokenization for NER and dependency parsing. To facilitate future research, we make these annotations publicly available. Our enhanced NArabizi Treebank paves the way for creating sophisticated language models and NLP tools for this under-represented language

ZENODO

INRIA a CCSD electronic archive server

Multilingual Auxiliary Tasks Training: Bridging the Gap between Languages for Zero-Shot Transfer of Hate Speech Detection Models

Author: Montariol Syrielle
Riabi Arij
Seddah Djamé
Publication venue: HAL CCSD
Publication date: 20/11/2022
Field of study

Accepted to Findings of AACL-IJCNLP 2022International audienceZero-shot cross-lingual transfer learning has been shown to be highly challenging for tasks involving a lot of linguistic specificities or when a cultural gap is present between languages, such as in hate speech detection. In this paper, we highlight this limitation for hate speech detection in several domains and languages using strict experimental settings. Then, we propose to train on multilingual auxiliary tasks -- sentiment analysis, named entity recognition, and tasks relying on syntactic information -- to improve zero-shot transfer of hate speech detection models across languages. We show how hate speech detection models benefit from a cross-lingual knowledge proxy brought by auxiliary tasks fine-tuning and highlight these tasks' positive impact on bridging the hate speech linguistic and cultural gap between languages

INRIA a CCSD electronic archive server

NEUROSURGERY ENTHUSIASTIC WOMEN SOCIETY

Can Character-based Language Models Improve Downstream Task Performance in Low-Resource and Noisy Language Scenarios?

Author: Riabi Arij
Sagot Benoît
Seddah Djamé
Publication venue: HAL CCSD
Publication date: 26/10/2021
Field of study

arXiv.org e-Print Archive

INRIA a CCSD electronic archive server

Tâches Auxiliaires Multilingues pour le Transfert de Modèles de Détection de Discours Haineux

Author: Montariol Syrielle
Riabi Arij
Seddah Djamé
Publication venue: 'Associacio catalana de Salut Laboral'
Publication date: 27/06/2022
Field of study

National audienceLa tâche de détection de contenus haineux est ardue, car elle nécessite des connaissances culturelles et contextuelles approfondies ; les connaissances nécessaires varient, entre autres, selon la langue du locateur ou la cible du contenu. Or, des données annotées pour des domaines et des langues spécifiques sont souvent absentes ou limitées. C’est là que les données dans d’autres langues peuvent être exploitées ; mais du fait de ces variations, le transfert cross-lingue est souvent difficile. Dans cet article, nous mettons en évidence cette limitation pour plusieurs domaines et langues et montrons l’impact positif de l’apprentissage de tâches auxiliaires multilingues - analyse de sentiments, reconnaissance des entités nommées et tâches reposant sur des informations morpho-syntaxiques - sur le transfert cross-lingue zéro-shot des modèles de détection de discours haineux, afin de combler ce fossé culturel

INRIA a CCSD electronic archive server

Hal-Diderot

Analyzing Zero-Shot transfer Scenarios across Spanish variants for Hate Speech Detection

Author: Castillo-López Galo
Riabi Arij
Seddah Djamé
Publication venue: Association for Computational Linguistics
Publication date: 01/05/2023
Field of study

International audienceHate speech detection in online platforms has been widely studied in the past. Most of these works were conducted in English and a few rich-resource languages. Recent approaches tailored for low-resource languages have explored the interests of zero-shot cross-lingual transfer learning models in resource-scarce scenarios. However, languages variations between geolects such as American English and British English, Latin-American Spanish, and European Spanish is still a problem for NLP models that often relies on (latent) lexical information for their classification tasks. More importantly, the cultural aspect, crucial for hate speech detection, is often overlooked. In this work, we present the results of a thorough analysis of hate speech detection models performance on different variants of Spanish, including a new hate speech toward immigrants Twitter data set we built to cover these variants. Using mBERT and Beto, a monolingual Spanish Bert-based language model, as the basis of our transfer learning architecture, our results indicate that hate speech detection models for a given Spanish variant are affected when different variations of such language are not considered. Hate speech expressions could vary from region to region where the same language is spoken

ZENODO

INRIA a CCSD electronic archive server

Fine-tuning and Sampling Strategies for Multimodal Role Labeling of Entities under Class Imbalance

Author: Montariol Syrielle
Riabi Arij
Seddah Djamé
Simon Étienne
Publication venue: 'Association for Computational Linguistics (ACL)'
Publication date: 27/05/2022
Field of study

International audienceWe propose our solution to the multimodal semantic role labeling task from the CON-STRAINT’22 workshop. The task aims at clas-sifying entities in memes into classes such as “hero” and “villain”. We use several pre-trained multi-modal models to jointly encode the text and image of the memes, and implement three systems to classify the role of the entities. We propose dynamic sampling strategies to tackle the issue of class imbalance. Finally, we per-form qualitative analysis on the representations of the entities

INRIA a CCSD electronic archive server